Search CORE

39 research outputs found

A Comprehensive Survey of Automated Audio Captioning

Author: Wu Mengyue
Xu Xuenan
Yu Kai
Publication venue
Publication date: 11/05/2022
Field of study

Automated audio captioning, a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. Audio captioning requires recognizing the acoustic scene, primary audio events and sometimes the spatial and temporal relationship between events in an audio clip. It also requires describing these elements by a fluent and vivid sentence. Deep learning-based approaches are widely adopted to tackle this problem. This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning

arXiv.org e-Print Archive

Enhance Temporal Relations in Audio Captioning with Sound Event Detection

Author: Wu Mengyue
Xie Zeyu
Xu Xuenan
Yu Kai
Publication venue
Publication date: 02/06/2023
Field of study

Automated audio captioning aims at generating natural language descriptions for given audio clips, not only detecting and classifying sounds, but also summarizing the relationships between audio events. Recent research advances in audio captioning have introduced additional guidance to improve the accuracy of audio events in generated sentences. However, temporal relations between audio events have received little attention while revealing complex relations is a key component in summarizing audio content. Therefore, this paper aims to better capture temporal relationships in caption generation with sound event detection (SED), a task that locates events' timestamps. We investigate the best approach to integrate temporal information in a captioning model and propose a temporal tag system to transform the timestamps into comprehensible relations. Results evaluated by the proposed temporal metrics suggest that great improvement is achieved in terms of temporal relation generation

arXiv.org e-Print Archive

Improving Audio Caption Fluency with Automatic Error Correction

Author: Wu Mengyue
Xie Zeyu
Xu Xuenan
Yu Kai
Zhang Hanxue
Publication venue
Publication date: 16/06/2023
Field of study

Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips. However, captions generated by previous AAC models have faced ``false-repetition'' errors due to the training objective. In such scenarios, we propose a new task of AAC error correction and hope to reduce such errors by post-processing AAC outputs. To tackle this problem, we use observation-based rules to corrupt captions without errors, for pseudo grammatically-erroneous sentence generation. One pair of corrupted and clean sentences can thus be used for training. We train a neural network-based model on the synthetic error dataset and apply the model to correct real errors in AAC outputs. Results on two benchmark datasets indicate that our approach significantly improves fluency while maintaining semantic information.Comment: Accepted by NCMMSC 202

arXiv.org e-Print Archive

BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data

Author: Wu Mengyue
Xie Zeyu
Xu Xuenan
Zhang Pingyue
Zhang Zhiling
Zhou Zelin
Zhu Kenny Q.
Publication venue
Publication date: 14/03/2023
Field of study

Compared with ample visual-text pre-training research, few works explore audio-text pre-training, mostly due to the lack of sufficient parallel audio-text data. Most existing methods incorporate the visual modality as a pivot for audio-text pre-training, which inevitably induces data noise. In this paper, we propose BLAT: Bootstrapping Language-Audio pre-training based on Tag-guided synthetic data. We utilize audio captioning to generate text directly from audio, without the aid of the visual modality so that potential noise from modality mismatch is eliminated. Furthermore, we propose caption generation under the guidance of AudioSet tags, leading to more accurate captions. With the above two improvements, we curate high-quality, large-scale parallel audio-text data, based on which we perform audio-text pre-training. Evaluation on a series of downstream tasks indicates that BLAT achieves SOTA zero-shot classification performance on most datasets and significant performance improvement when fine-tuned on downstream tasks, suggesting the effectiveness of our synthetic data

arXiv.org e-Print Archive

Sound-Based Construction Activity Monitoring with Deep Learning

Author: Jian Yang
Long Chen
Wuyue Xiong
Xuenan Xu
Publication venue: MDPI AG
Publication date: 01/11/2022
Field of study

Automated construction monitoring assists site managers in managing safety, schedule, and productivity effectively. Existing research focuses on identifying construction sounds to determine the type of construction activity. However, there are two major limitations: the inability to handle a mixed sound environment in which multiple construction activity sounds occur simultaneously, and the inability to precisely locate the start and end times of each individual construction activity. This research aims to fill this gap through developing an innovative deep learning-based method. The proposed model combines the benefits of Convolutional Neural Network (CNN) for extracting features and Recurrent Neural Network (RNN) for leveraging contextual information to handle construction environments with polyphony and noise. In addition, the dual threshold output permits exact identification of the start and finish timings of individual construction activities. Before training and testing with construction sounds collected from a modular construction factory, the model has been pre-trained with publicly available general sound event data. All of the innovative designs have been confirmed by an ablation study, and two extended experiments were also performed to verify the versatility of the present model in additional construction environments or activities. This model has great potential to be used for autonomous monitoring of construction activities

Directory of Open Access Journals

Comparison Between Different Ways in Making Silicon Dioxide Layer on Silicon Wafers

Author: Chenguang Sun
Qiang Xu
Xuenan Zhang
Yanjun Wang
Zhenfu Liu
Publication venue: 'EDP Sciences'
Publication date: 09/12/2016
Field of study

The most important raw material for power device is epitaxial wafers, which is made from heavily doped polishing silicon wafers. In order to prevent dopant doped into silicon while pulling ingots spread into the epitaxial atmosphere and finally into the Epitaxial layer, while polishing wafers are produced, silicon dioxide layer will be added on the back surface of silicon wafers. There are two ways mainly used to make this silicon dioxide layer, one is named High Temperature Oxide (HTO), and the other is named Atmosphere Pressure Chemical Vapor Deposition (APCVD). The process parameter such as temperature, time and pressure are quite different from each other, also the character of the silicon dioxide layer and the effect that silicon dioxide layer have on the wafers are different from each other. In this article we will compare the difference between the two silicon dioxide layers and the ways in making silicon dioxide layers

EDP Sciences OAI-PMH repository (1.2.0)